DiscoverEA Forum Podcast (All audio)[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus
[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

Update: 2025-10-28
Share

Description

This is a link post.

By Ben Wilson and John Bash from Metaculus

Main Takeaways

Top Findings

  • Pro forecasters significantly outperform bots: Our team of 10 Metaculus Pro Forecasters demonstrated superior performance compared to the top-10 bot team, with strong statistical significance (p = 0.00001) based on a one-sided t-test on Peer scores.
  • The bot team did not improve significantly in Q2 relative to the human Pro team: The bot team's head-to-head score against Pros was -11.3 in Q3 2024 (95% CI: [-21.8, -0.7]), then -8.9 in Q4 2024 (95% CI: [-18.8, 1]), then -17.7 in Q1 2025 (95% CI: [-28.3, -7.0]), and now -20.03 [-28.63, -11.41] with no clear trend emerging. (Reminder: a lower head-to-head score indicates worse relative accuracy. A score of 0 corresponds to equal accuracy.)

Other Takeaways

  • This quarter's winning bot is open-source: Q2 Winner Panshul has very generously made his bot open-source. The bot writes separate “outside view” and “inside view” [...]

---

Outline:

(00:20 ) Main Takeaways

(03:24 ) Introduction

(04:30 ) Methodology

(13:59 ) How do LLMs Compare?

(17:18 ) Which Bot Strategy is Best?

(23:04 ) Are Bots Better than Human Pros?

(25:38 ) Binary vs Numeric vs Multiple Choice Questions

(27:07 ) Team Performance Over Quarters

(31:14 ) Bot Maker Survey

(31:40 ) Best practices of the best-performing bots

(38:27 ) Other Survey Results

(41:32 ) How did scaffolding do?

(45:33 ) Advice from Bot Makers

(53:48 ) Links to Code and Data

(54:56 ) Future AI Benchmarking Tournaments

---


First published:

October 28th, 2025



Source:

https://forum.effectivealtruism.org/posts/F2stjK9wHSy3HPEC9/q2-ai-benchmark-results-pros-maintain-clear-lead



Linkpost URL:
https://www.metaculus.com/notebooks/40456/q2-ai-benchmark-results/


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Bar graph showing average scores, ranging from +20 to -20, color-coded.
Leaderboard table showing performance rankings of 10 different AI models/bots.
A histogram showing score distribution, with peak values near zero.
Bar graph showing
Bar graph showing
Pie chart showing development hours spent on team bot projects from 19 responses.
Bar graph showing
Bar graph and data table comparing
Bar graph showing
Tournament results table showing rankings and scores for 19 different bots.
Pie chart showing
Bar graph titled
Bar graph comparing
Bar graph titled
Bar graph showing
Bar graph comparing scores across different MetaC and GeminiAI model variants
Comparison table showing metrics for Aggregation, Manual review, Custom Questions, and AskNews.
Bar graph comparing
Ranking table showing performance scores of various AI models, with metac-03+asknews leading.
<a href="https://39669.cdn.cke-cs.com/cgyAlfpLFBBiEjoXacnz/images/62929a60884878146c86b7ca234ab4f6bd1706d412fc25d4.png" target="_bla
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus